The annotation of the C-ORAL-BRASIL spoken corpus using an adaptation of the Palavras Parser
نویسندگان
چکیده
This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Using a two-level annotation approach, speech flow markers like overlaps, retractions and non-verbal productions were separated from running, annotatable text. In the absence of punctuation, syntactic segmentation was achieved by exploiting prosodic break markers, enhanced by a rule-based distinctions between pause and break functions. Under optimal conditions, the modified parsing system achieved correctness rates (F-scores) of 98.6% for part of speech, 95% for syntactic function and 99% for lemmatization. Especially at the syntactic level, a clear connection between accessibility of prosodic break markers and annotation performance could be documented.
منابع مشابه
The C-ORAL-BRASIL I: Reference Corpus for Spoken Brazilian Portuguese
C-ORAL-BRASIL I is a Brazilian Portuguese spontaneous speech corpus compiled following the same architecture adopted by the C-ORAL-ROM resource. The main goal is the documentation of the diaphasic and diastratic variations in Brazilian Portuguese. The diatopic variety represented is that of the metropolitan area of Belo Horizonte, capital city of Minas Gerais. Even though it was not a primary g...
متن کاملGrammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary
In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...
متن کاملStudying impressive parameters on the performance of Persian probabilistic context free grammar parser
In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...
متن کاملA Corpus-based Analysis of Collocational Errors in the Iranian EFL Learners' Oral Production
Collocations are one of the areas generally considered problematic for EFL learners. Iranian learners of English like other EFL learners face various problems in producing oral collocations. An analysis of learners' spoken interlanguage both indicates the scope of the problem and the necessity to spend more time and energy by learners on mastering collocations. The present study specifically f...
متن کاملAutomatic Adaptation of Annotation Standards for Dependency Parsing ? Using Projected Treebank as Source Corpus
We describe for dependency parsing an annotation adaptation strategy, which can automatically transfer the knowledge from a source corpus with a different annotation standard to the desired target parser, with the supervision by a target corpus annotated in the desired standard. Furthermore, instead of a hand-annotated one, a projected treebank derived from a bilingual corpus is used as the sou...
متن کامل